Optimal-Time Text Indexing in BWT-runs Bounded Space
نویسندگان
چکیده
Indexing highly repetitive texts — such as genomic databases, software repositories and versioned text collections — has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive collections, the Run-Length FMindex, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. Since then, a number of other indexes with space bounded by other measures of repetitiveness — the number of phrases in the LempelZiv parse, the size of the smallest grammar generating the text, the size of the smallest automaton recognizing the text factors — have been proposed for efficiently locating, but not directly counting, the occurrences of a pattern. In this paper we close this long-standing problem, showing how to extend the Run-Length FMindex so that it can locate the occ occurrences efficiently within O(r) space (in loglogarithmic time each), and reaching optimal time O(m+ occ) within O(r log(n/r)) space, on a RAM machine with words of w = Ω(log n) bits. Raising the space to O(rw logσ(n/r)), we support locate in O(m log(σ)/w+ occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(r log(n/r)) space that replaces the text and efficiently extracts any text substring, with an O(log(n/r)) additive time penalty over the optimum. Preliminary experiments show that our new structure outperforms the alternatives by orders of magnitude in the space/time tradeoff map. ∗Partially funded by Basal Funds FB0001, Conicyt, by Fondecyt Grants 1-170048 and 1-171058, Chile, and by the Danish Research Council DFF-4005-00267. †EIT, Diego Portales University, and Center for Biotechnology and Bioengineering (CeBiB), Chile, [email protected] ‡Department of Computer Science, University of Chile, and Center for Biotechnology and Bioengineering (CeBiB), Chile, [email protected] §DTU Compute, Technical University of Denmark, Denmark, [email protected]
منابع مشابه
Fast Locating with the RLBWT
Indexing highly repetitive texts — such as genomic databases, software repositories and versioned text collections — has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) ...
متن کاملEntropy-Compressed Indexes for Multidimensional Pattern Matching
In this talk, we will discuss the challenges involved in developing a multidimensional generalizations of compressed text indexing structures. These structures depend on some notion of Burrows-Wheeler transform (BWT) for multiple dimensions, though naive generalizations do not enable multidimensional pattern matching. We study the 2D case to possibly highlight combinatorial properties that do n...
متن کاملBurrows-Wheeler transform and LCP array construction in constant space
In this article we extend the elegant in-place Burrows-Wheeler transform (BWT) algorithm proposed by Crochemore et al. (Crochemore et al., 2015). Our extension is twofold: we first show how to compute simultaneously the longest common prefix (LCP) array as well as the BWT, using constant additional space; we then show how to build the LCP array directly in compressed representation using Elias ...
متن کاملParameterized Pattern Matching - Succinctly
The fields of succinct data structures and compressed text indexing have seen quite a bit of progress over the last 15 years. An important achievement, primarily using techniques based on the Burrows-Wheeler Transform (BWT), was obtaining the full functionality of suffix tree in the optimal number of bits. A crucial property that allows the use of BWT for designing compressed indexes is order-p...
متن کاملCompressed indexing and local alignment of DNA
MOTIVATION Recent experimental studies on compressed indexes (BWT, CSA, FM-index) have confirmed their practicality for indexing very long strings such as the human genome in the main memory. For example, a BWT index for the human genome (with about 3 billion characters) occupies just around 1 G bytes. However, these indexes are designed for exact pattern matching, which is too stringent for bi...
متن کامل